in 
o 

(N 



Variational inference for generalized linear mixed 
models using partially noncentered parametrizations 

Linda S. L. Tan and David J. Nott 

Abstract 

The effects of different parametrizations on the convergence of Bayesian com- 
putational algorithms for hierarchical models are well explored. Techniques such as 
centering, noncentering and partial noncentering can be used to accelerate conver- 
gence in MCMC and EM algorithms but are still not well studied for variational 
Bayes (VB) methods. As a fast deterministic approach to posterior approximation, 

£C} VB is attracting increasing interest due to its suitability for large high-dimensional 

data. Use of different parametrizations for VB has not only computational but also 

CN statistical implications as different parametrizations are associated with different 

factorized posterior approximations. We examine the use of partially noncentered 
parametrizations in VB for generalized linear mixed models (GLMMs). Our paper 
makes four contributions. First, we show how to implement an algorithm called 
non-conjugate variational message passing for GLMMs. Second, we show that the 
partially noncentered parametrization can adapt to the quantity of information in 
the data and determine a parametrization close to optimal. Third, we show that 
partial noncentering can accelerate convergence and produce more accurate poste- 
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rior approximations than centering or noncentering. Finally, we demonstrate how 
the variational lower bound, produced as part of the computation, can be useful 
for model selection. 
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1 Introduction 



X 

The convergence of Markov chain Monte Carlo (MCMC) algorithms depends greatly on 
the choice of parametrization and simple reparametrizations can often give improved con- 
vergence. Here we investigate the use of centered, noncentered and partially noncentered 
parametrizations of hierarchical models in the context of variational Bayes (VB) (Attias, 
1999). As a fast deterministic approach to approximation of the posterior distribution 
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in Bayesian inference, VB is attracting increasing interest due to its suitability for large 
high-dimensional data (see, e.g., Braun and McAuliffe, 2010; Hoffman et al, 2012). VB 
methods approximate the intractable posterior by a factorized distribution which can 
be represented by a directed graph and optimization of the factorized variational poste- 
rior can be decomposed into local computations that involve only neighbouring nodes. 
Variational message passing (Winn and Bishop, 2005) is an algorithmic implementation 
of VB that can be applied to a general class of conjugate-exponential models (Attias, 
2000; Ghahramani and Beal, 2001). Knowles and Minka (2011) proposed an algorithm 
called non-conjugate variational message passing to extend variational message passing 
to non-conjugate models. 

We examine the use of partially noncentered parametrization in VB for generalized 
linear mixed models (GLMMs). Our paper makes four contributions. First, we show 
how to implement non-conjugate variational message passing for GLMMs. Second, we 
show that the partially noncentered parametrization is able to adapt to the quantity of 
information in the data so that it is not necessary to make a choice in advance between 
centering and noncentering with the data deciding the optimal parametrization. Third, we 
show that in addition to accelerating convergence, partial noncentering is a good strategy 
statistically for VB in terms of producing more accurate approximations to the posterior 
than either centering or noncentering. Finally, we demonstrate how the variational lower 
bound, which is produced as part of the computation, can be useful for model selection. 

GLMMs extend generalized linear models by the inclusion of random effects to account 
for correlation of observations in grouped data and are of wide applicability. Estimation 
of GLMMs using maximum likelihood is challenging as the integral over random effects 
is intractable. Methods involving numerical quadrature or MCMC to approximate these 
integrals are computationally intensive. Various approximate methods such as penalized 
quasi-likelihood (Breslow and Clayton, 1993), Laplace approximation and its extension 
(Raudenbush et al, 2000) and Gaussian variational approximation (Ormerod and Wand, 
2012) have been developed. Fong et al. (2010) considered a Bayesian approach using inte- 
grated nested Laplace approximations. We show how to fit GLMMs using non-conjugate 
variational message passing, focusing on Poisson and logistic mixed models and their 
applications in longitudinal data analysis. 

The literature on parametrization of hierarchical models including partial noncenter- 
ing techniques for accelerating MCMC algorithms is inspired by earlier similar work for 
the expectation maximization (EM) algorithm (see Meng and van Dyk, 1997, 1999; Liu 
and Wu, 1999). Gelfand et al. (1995, 1996) proposed hierarchical centering for normal lin- 
ear mixed models and GLMMs to improve the slow mixing in MCMC algorithms due to 
high correlations between model parameters. Papaspiliopoulos et al. (2003, 2007) demon- 
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strated that centering and noncentering play complementary roles in boosting MCMC 
efficiency and neither are uniformly effective. They considered the partially noncentered 
parametrization which is data dependent and lies on the continuum between the cen- 
tered and noncentered parametrizations. Extending this idea, Christensen et al. (2006) 
devised reparametrization techniques to improve performance for Hastings-within Gibbs 
algorithms for spatial GLMMs. Yu and Meng (2011) introduced a strategy for boosting 
MCMC efficiency via interweaving the centered and noncentered parametrizations to re- 
duce dependence between draws. Parameter-expanded VB methods were proposed by Qi 
and Jaakkola (2006) to reduce coupling in updates and speed up VB. 

The idea of partial noncentering is to introduce a tuning parameter via reparametriza- 
tion of the model and then seek its optimal value for fastest convergence. For the normal 
hierarchical model, Papaspiliopoulos et al. (2003) showed that the partially noncentered 
parametrization has convergence properties superior to that of the centered and noncen- 
tered parametrizations for the Gibbs sampler. As the rate of convergence of an algorithm 
based on VB is equal to that of the corresponding Gibbs sampler when the target distri- 
bution is Gaussian (Tan and Nott, 2012), partial noncentering will similarly outperform 
centering and noncentering in the context of VB for the normal hierarchical model. This 
provides motivation to consider partial noncentering in the VB context. We illustrate this 
idea with the following example. 

1.1 Motivating example: linear mixed model 

Consider the linear mixed model 

yi = Xtf + X iUi + e u q ~ iV(0,a 2 /), i = l,...,n, (1) 

where y^ is a vector of length n,, (3 is a vector of length r of fixed effects, Xi is a rii x r matrix 
of covariates and U{ is a vector of length r of random effects independently distributed 
as N(0,D). For simplicity, we specify a constant prior on (5 and assume a 2 and D are 
known. Let 

ctj = (3 + Ui and oti — a.i — Wi/3, i — l,...,n, 

where Wi is an r x r tuning matrix to be specified. Wi = corresponds to the centered 
and Wi — I to the noncentered parametrization. For each i — 1, ...,n, 

y i = X i W i l3 + X i a i + e i and ^ ~ TV ((/ - V^)A D) . 

This is the partially noncentered parametrization and the set of unknown parameters is 
6 = {(3, a} where a = [aj , <5^] T . Let y = [yi, y n ] T denote the observed data. Of 
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interest is the posterior distribution of 9, p(9\y). 

Suppose p(9\y) is not analytically tractable. In the variational approach, we approxi- 
mate p{0\y) by a q(9) for which inference is more tractable and q(9) is chosen to minimize 
the Kullback-Leibler divergence between q{9) and p{9\y) given by 

J ' q{mos M) iB = J f ' Mlos M #+loffW 

where p(y) is the marginal likelihood p(y) = J p{y\9)p(9) d9. Since the Kullback-Leibler 
divergence is non-negative, 



\ogp{y)> [ logP^± q (0) d9 



= E q {logp(y,9)}-E q {logq(9)} 

= £ (2) 

where £ is a lower bound on the log marginal likelihood. Maximization of C is equivalent 
to minimisation of the Kullback-Leibler divergence between q(6) and p{9\y). In VB, q(8) 
is assumed to be of a factorized form, say q{9) = YYiLi Qi{6i) f° r some partition {9i, 9 m } 
of 9. Maximization of C over each of qi, q m lead to optimal densities satisfying qi(9i) oc 
exp{E_g i \ogp(y, 9)}, i = l,...,m, where E_g. denotes expectation with respect to the 
density Ylj^iQji^j)- See Ormerod and Wand (2010) for an explanation of variational 
approximation methods very accessible to statisticians. 

If we apply VB to Q and approximate the posterior p(9\y) with q{9) = q((3)q(a), the 
optimal densities can be derived to be q((3) = N(fi 9 p, T:V) and q(a) = Yl7=i where 
q{ati) = N(fi q & ^, E| t ). The expressions for the variational parameters /i^, E^ and //|., E|., 
i = 1, ...,m, are however dependent on each other and can be computed by an iterative 
scheme such as that given in Algorithm 1. 

Initialize and E|. for i = 1, ...,n. 
Cycle: 

n <- e=i {(i - m T D-\i - + ^wrxfxm}]- 1 
4 <- n Eti [^w?xf yi + { D -\i - wo - j,xjxm T ni] 

For i = 1, ...,n, 

K^iD-' + ^XfX^ 1 

A <- [^Xj Vi + {D-\I - W t ) - ^XTXiWWfA 
until convergence. 

Algorithm 1: Iterative scheme for obtaining variational parameters in linear mixed model. 
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Observe that Algorithm 1 converges in one iteration if D 1 {I — Wi) = -^XfXjWi for 
each i, that is, if 



W l = {^XjXi + D- l Y l D-\ for i = 1, n. 



(3) 



For this specification of the tuning parameters, partial noncentering gives more rapid 
convergence than centering or noncentering. Moreover, it can be shown that the true 
posteriors are recovered in this partially noncentered parametrization so that a better fit 
is achieved than in the centered or noncentered parametrizations. This example suggests 
that with careful tuning of Wi, i = 1, n, the partially noncentered parametrization can 
potentially outperform the centered and noncentered parametrizations in the VB context. 

The rest of the paper is organized as follows. Section 2 specifies the GLMM and priors 
used. Section 3 describes the partially noncentered parametrization for GLMMs. Section 
4 describes the non-conjugate variational message passing algorithm for fitting GLMMs. 
Section 5 discusses briefly the use of the variational lower bound for model selection and 
Section 6 considers examples including real and simulated data. Section 7 concludes. 

2 The generalized linear mixed model 

Consider clustered data where y^- denotes the jth response from cluster i, i = l,...,n, 
j = l,...,rij. Conditional on the r-dimensional random effects Ui drawn independently 
from N(0,D), yij is independently distributed from some exponential family distribution 
with density 



where Qj is the canonical parameter, <f> is the dispersion parameter, and a(-), b(-) and 
c(-) are functions specific to the family. The conditional mean of y^, fiij = E(yij\ui), is 
assumed to depend on the fixed and random effects through the linear predictor, 



with 9(P „) = , te for some known link fnnction, ,(•). Here, Xg and X„ = [*«' , Xff 
are r x 1 and p x 1 vectors of covariates and f3 = [/3 rT ', f3° T } T is a p x 1 vector of fixed 
effects. We considered the above breakdown (see Zhao et ai, 2006) for the linear predictor 
to allow for centering. For the ith cluster, let yi = [yn, ■■■,ym i \ T , Xj* = [X^, X^] T , 
x ? = [^tt)-!^F, x i = [Xn,...,X in .] T and ^ = [rj n , rj ini ] T . Let l ni denote the 
Hi x 1 column vector with all entries equal to 1. We assume that the first column of Xf 





= x* 1 p R + xf p G + x* 1 Ui 
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is l ni if X R is not a zero matrix. For Bayesian inference, we specify prior distributions 
on the fixed effects (3 and random effects covariance matrix D. In this paper, we focus on 
responses from the Bernoulli and Poisson families and the dispersion parameter is one in 
these cases so we do not consider a prior for <fi. We assume a diffuse prior, N(0, S^), for 
[3 and an independent inverse Wishart prior, IW(v, S), for D. Following the suggestion 
by Kass and Natarajan (2006), we set v = r and let the scale matrix S be determined 
from first-stage data variability. In particular, S = rR where 



R = c(-j2xfMS)X* 

\ i=l J 



-1 



(5) 



Mi0) denotes the rii x rii diagonal generalized linear model weight matrix with diagonal 
elements [<f>v (fa) g' (fa) 2 ]' 1 , v(-] 



is the variance function based on /(.) in (|4|) and g(-) is 
the link function. Here, fa = g~ 1 (Xj-f3 + X RT Ui) where Ui is set as for all i and (3 is 
an estimate of the regression coefficients from the generalized linear model obtained by 
pooling all data and setting ui = for all i. The value of c is an inflation factor representing 
the amount by which within-cluster variability should be increased in determining R. We 
used c = 1 in all examples. 



3 A partially noncentered parametrization for the 
generalized linear mixed model 

We introduce the following partially noncentered parametrization for the GLMM. For 



each i — 1, n, the linear predictor is r/i = X^f3 R + X^f3 u + X^Ui. Let 



'GoG 



X«/3 U = Xf 1 /3 Gl +XY 2 f3 

= l ni xf lT /3 Gl +Xf 2 /3 G2 

where /3 Gl is a vector of length g\ consisting of all parameters corresponding to subject 
specific covariates (that is, the rows of X i 1 are all the same and equal to the vector xf 1 
say). Recall that the first column of X^ is l ni if Xf is not a zero matrix. We have 



0G2 



Xf{Cif3 RGl + Ui )+X^f3 G2 where Q 



Gi 



and (3 



(3 R 



Let «j = Ci(3 RGl + Ui and hi 



WiCi(3 RGl where Wi is an r x r matrix to be 



specified. The proportion of Ci(3 RGl subtracted from each «j is allowed to vary with i 
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Figure 1: Factor graph for p(y,9) in ([7]). Filled rectangles denote factors and circles 
denote variables (shaded for observed variables). Smaller filled circles denote constants 
or hyperparameters. The box represents a plate which contains variables and factors to 
be replicated. Number of repetitions is indicated in lower right corner. 



as in Papaspiliopoulos et al. (2003) to reflect the varying informativity of each response 
Hi about the underlying at*. Wi = corresponds to the centered and Wi — I to the 
noncentered parametrization. Finally, 

Vi = X?(ai + WiPiP™ 1 ) + Xf 2 /3 G2 

= V i p + X?& i (6) 

where V t = [XfWid Xf 2 ] and a { ~ N ((I — Wi)Ci(3 RGl , D) . We refer to (g) as the 
partially noncentered parametrization. Let a = [aj , a^] T and 9 = {(3,D,a} denote 
the set of unknown parameters in the GLMM. The joint distribution of p(y, 9) is 

P(V,9) = |np(%l^ 5 ^Mail^^)jKi9|^M^I^5'). (7) 

Figure [l] shows the factor graph for p(y, 9) where there is a node (circle) for every variable, 
which is shaded in the case of observed variables and a node (filled rectangle) for each 
factor in the joint distribution. Constants or hyperparameters are denoted with smaller 
filled circles. Each factor node is connected by undirected links to all of the variable nodes 
on which that factor depends (see Bishop, 2006). Next, we consider specification of the 
tuning parameter Wi, referring to the linear mixed model example in Section 1.1 which 
is a special case of the GLMM in Q with an identity link. 



3.1 Specification of tuning parameter 

It is interesting to note that for the linear mixed model in (JTJ), the expression for Wi leading 
to optimal performance in VB and the Gibbs sampling algorithm is exactly the same (see 
Papaspiliopoulos et al, 2003). Gelfand et al. (1995) also observed the importance of Wi 
in assessing convergence properties of the centered parametrization. They showed that 
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\Wi\ < 1 for all i and \ Wi\ is close to zero (centering is more efficient) when the generalized 
variance \D\ is large. On the other hand, \Wi\ is close to 1 (noncentering works better) 
when the error variance is large. Outside the Gaussian context, Papaspiliopoulos et al. 
(2003) considered partial noncentering for the spatial GLMM and specified the tuning 
parameters by using a quadratic expansion of the log-likelihood to obtain an indication 
of the information present in yi. Observe that Wi in (|3p can be expressed as 

Wi= {l f + D- i y l D-\ (8) 

if I = log p(yi\P, a>i) denotes the log-likelihood and Xf = — q^qt - We use (8) to extend 
partially noncentered parametrizations to GLMMs and consider the specification of Wi 
for responses from the Bernoulli and Poisson families in particular. 

Recall that the linear predictor rji can be expressed as X^oti + Xf 2 f3° 2 . For Poisson 
responses with the log link function, we allow for an offset log E^ so that log = 
log Eij + rjij. Let Ei = [E a , E irii ] T . We have 

t = yj (log Ei + th) - Ef exp(^) - 1 J log(^!) (9) 

and Tj =J2E ij eMVi j )X3 X f « E^ X 5^f 

0=1 3=1 

if we approximate the conditional mean /iy with the response. For Bernoulli responses 
with the logit link function, we have 

e = yIVi-llAog{l ni +eMm)} and If = jt {1 +Ipfa. )}2 ^f • ( 10 ) 

The specification of Wi depends on the random effects covariance D and for Bernoulli 
responses, on the linear predictor rji as well. In Algorithm 3, we initialize Wi by considering 
r/i = Xi/3 + Xfui and using estimates of D, (3 and Ui from penalized quasi-likelihood. 
Subsequently, we can either keep Wi as fixed or update them by replacing D with q ^ q x , 
assuming the variational posterior of D is lWiy q , S q ) and rji with ViiA + X^fi q & ., where 
and are the variational posterior means of /3 and <5j respectively. This can be done 
at the beginning of each iteration after new estimates of fii, /i?., v q and S q are obtained 
(see Algorithm 3 step 1). 



S 



4 Variational inference for GLMMs 



In this section, we present the non-conjugate variational message passing algorithm re- 
cently developed in machine learning by Knowles and Minka (2011) for fitting GLMMs. 
Recall that in VB, the posterior distribution p(6\y) is approximated by a q(9) which is 
assumed to be of a factorized form, say q(9) = YliLi Qi{@i) f° r some partition {9i, 9 m } 
of 9. For conjugate-exponential models, the optimal densities will have the same form 
as the prior so that it suffices to update the parameters of qi, such as in Algorithm 1. 
Variational message passing (Winn and Bishop, 2005) is an algorithm which allows VB to 
be applied to conjugate-exponential models without having to derive application-specific 
updates. In the case of GLMMs where the responses are from the Bernoulli or Poisson 
families, the factor p(yi\f3,&i) of p(y,9) in (JTl) is non-conjugate with respect to the prior 
distributions over /3 and cti for each i — 1, ...,n. Therefore, if we apply VB and assume 
say q{9) = q((3)q(D) YYi=i <?(«i), the optimal densities for q((3) and q(a.i) will not belong 
to recognizable density families. 



4.1 Non-conjugate variational message passing 

In non-conjugate variational message passing, besides assuming that q{9) must factorize 
into YULi^ii^i) f° r some partition {9i,...,9 m } of 9, we impose another restriction that 
each qi must belong to some exponential family. In this way, we only have to find the 
parameters of each q { that maximizes the lower bound C. Suppose each q { can be written 
in the form 

ft (0 f ) = exp{Xjt(9i) - h(Xi)} 

where Aj is the vector of natural parameters and £(•) are the sufficient statistics. We wish 
to maximize C with respect to the variational parameters Ai, X m which are also natural 
parameters of <7i(#i), qm(9 m ) respectively. In the following, we show that non-conjugate 
variational message passing can be interpreted as a fixed-point iteration where updates 
are obtained from the condition that the gradient of C with respect to each Aj is zero 
when £ is maximized. 

From ([2]) , the gradient of L with respect to \ is 

§- = £E q {\ogp(y,9)} - £E q {logq(9)}. (11) 



Let us consider the first term in (11 ). Suppose p(y, 9) = Y[ a fa(y, 9). We have 



E q {logp(y,9)} = Y,S a where S a = E q {log f a (y,9)}. 
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Note that each S a is a function of the natural parameters Ai, A m . Since we have assumed 
that 9i is independent of all 9j where j 7^ i in the variational approximation q, the only 
terms in J2 a which depend on A« are the factors f a connected to 9i in the factor graph 
of p(y, 9). Therefore, 

£:E q {lo g p(y,6)}= £ ft (12) 

aeN{e t ) 

where the summation is over all factors in N(9i), the neighbourhood of 9i in the factor 
graph. For the second term in (11), we have E q {\og q(9)} = Y^iLi Eq{^°gQi(8l)} where 
the only term in the sum that depends on A« is the zth term. Hence, 

^{loggW} = ft{Af^M-MA,)} 

= V(A 4 )A,, (13) 



d 



where we have used the fact that E q {t(9i)} = dh gy^ and V(Aj) = q^q^t denotes the 
variance- covariance matrix of t(9i). Note that V(Aj) is symmetric positive semi-definite. 



Putting (12) and (13) together, the gradient of the lower bound is 



£>A, 



and is zero when A, = V(Aj) 1 XLga^) ft' P rov ided V(Aj) is invertible. This condition 
is used to obtain updates to A, in non-conjugate variational message passing (Algorithm 
2). 



Initialize Aj for i = 1, m. 
Cycle: 

For i = 1, m, 

Xi^V (Ai) 1 J2aeN(9i) ft 

until convergence. 
Algorithm 2: Non-conjugate variational message passing. 

The updates can be simplified when the factor f a is conjugate to qi(9i), that is, f a has 
the same functional form as qi{9i) with respect to 0*. Let 6Lj = (9%, ...,9 m ). 
Suppose 

f a (y,9) = eM9a{y,0-i) T t{9i) - h a (y,9^)}. 

Then §f» = V(\i)E q {g a (y,9_i)}, where E q {g a (y,9_i)} does not depend on A*. When 
every factor in the neighbourhood of 9i is conjugate to qi(9i), the gradient of the lower 
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bound can be simplified to V(Aj) XLga^) E q {g a (y, 9-i)} — AjJ and the updates in non- 
conjugate variational message passing reduce to 

A,^ J2 E q {g a (y,9_ t )}. (14) 

a€N(8i) 

These are precisely the updates in variational message passing. Non-conjugate variational 
message passing thus reduces to variational message passing for conjugate factors (see also 
Knowles and Minka, 2011). Unlike variational message passing however, the Kullback- 
Leibler divergence is not guaranteed to decrease at each step and sometimes convergence 
problems may be encountered. Knowles and Minka (2011) suggested using damping to fix 
convergence problems. We did not encounter any convergence problems in the examples 
considered in this paper. 



4.2 Updates for multivariate Gaussian variational distribution 

Suppose qi is Gaussian. While the updates in Algorithm 2 are expressed in terms of the 
natural parameters A«, it might be more convenient to express ^ in terms of the mean 
and covariance of q^. Knowles and Minka (2011) have considered the univariate case and 
Wand (2013) derived fully simplified updates for the multivariate case. However, as Wand 
(2013) is in preparation, we give enough details of the update so that the derivation can be 
understood. Magnus and Neudecker (1988) is a good reference for the matrix differential 
calculus techniques used in the derivation. 

Suppose qi{6i) = N(fj,g , where 9i is a vector of length d. For a dxd square matrix 
A, vec(A) denotes the d 2 x 1 vector obtained by stacking the columns of A under each 
other, from left to right in order and vech(A) denotes the \d{d + 1) x 1 vector obtained 
from vec(A) by eliminating all supradiagonal elements of A. We can write qi{6i) as 



exp ^ A^ 



vech(0i0f) 



where A,- 



-l^ T vec(E« 

yq -1 q 



q 



and/i(Ai) = l/jL q e . T T, q . V^ + l lo £ l S <U + ! lo g( 27r )- The matrix D d is a unique d 2 x \d{d+l) 
matrix that transforms vech(A) into vec(A) if A is symmetric, that is, Ddvech(A) = 
vec(A). Let D\ denote the Moore-Penrose inverse of D d . If we let \n = —7 l Djvec(T l q 9 ~ 1 ] 
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and X i2 = V^j ff 4 can be expressed as 



\dS a ] 








dS a 









dvec(T,g) dp,g, 
dXn d Aji 



9A, 









95 a 




dvec(£«.) 
. 9 4 . 


= *7(A0 


9vcc(S«.) 



where U(Xi) = 



^ 

2 /i(Aj) 



and <g) denotes the Kronecker product. Moreover, V(Aj) = 9a .q A t can De derived to be 



2DMu* T ® Si + E« ® + E« ® E« )£>+*' 2DM ® El) 



,9 ,,9 T 



T 



q \\T 



{2D+(^ 4 ®E| 1 )} 
The update for A, can be computed as 

95g 



Aj <- v(x i y 1 u(x i ) 



<9vec(E«.) 
95 a ' 



and V(A i )- 1 f/(A i 



-2{f®I)DfDl I 



Wand (2013) showed that the updates simplify to 
1 r 



E . 4r- 



vcc 



1 £ 



1 and ^ «- ^ + S« V |%. (15) 



A more detailed version of the argument will be given in the forthcoming manuscript of 
Wand (2013). 



4.3 Non-conjugate variational message passing algorithm for 
generalized linear mixed models 

For the GLMM, we consider a variational approximation of the form 



q{9)=q{P)q{D)J{q{a i ) 



(16) 



i=i 



where q((3) is N (/A, E|), q(D) is IW(u q , S q ), and q(aii) is iV E|.), all belonging to 
the exponential family. Here, we approximate the posterior distributions of (3 and <Sj by 
Gaussian distributions which are often reasonable and supported by the asymptotic nor- 
mality of the posterior. Our results also indicate that Gaussian approximation performs 
reasonably well as an approximation to the posterior in finite samples. See Gelman et 
al. (2003) for further discussion and counterexamples. The posterior distribution for D is 
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approximated by an inverse Wishart which can be shown to be the optimal density under 
only the VB assumption q(8) = q((3)q(D)q(a). The non-conjugate variational message 
passing algorithm for GLMMs is outlined in Algorithm 3. 

Initialize /A, £|, S q and S|., Wi for i — 1, ...,n and set v q = n + v. 
Cycle: 

1. Update Wi and hence Vi for i — 1, n. (Optional) 

/ n _ n n i \ — 1 



i=l j=l 



i=i 



i=i 



3. For z = 1, n, 



v j=l 



until the absolute relative change in the lower bound C is negligible. 
Algorithm 3: Non-conjugate variational message passing for fitting GLMMs. 



In Algorithm 3, for each % = l,...,n, j = l,...,nj, Wi 



x(p-r-pi) 



«y is the jth component of k { = exp{^^ + + §diag(V-££V- T + AfS^Af )}, 

/ijj is the jth component of \ii = Vi/ii + Xf\i\^ Oij is the jth component of Oi = 

diag(V^Vf + X^l.Xf) and a) = ^(crx+^^e"* 2 where b(x) = 

log(l + e x ) and b^ r \x) denotes the rth derivative of b(-) with respect to x. If [i and a are 

rBM(l,4)' 



vectors, say ll 



1 

2 


and cr = 


4 

5 


3 




6 



then B^(fi,a) 



BM(2,5) 
BW(3,6) 



. In addition, 



if Poisson, 



(fiij , aij) if logistic, 



and G, 



£L Kj 



if Poisson, 



B^(fii,ai) if logistic, 



where a b denotes the element-wise product of two vectors, a and b. 

The updates in Algorithm 3 can be obtained from the formulae in (14) and (15). 
Consider the parameters v q and S q of q(D). The factors connected to D are p(D\u, S) 
and p(ai\(3,D), i — 1, ...,n, which are all conjugate factors. Therefore, updates for q(D) 



can be obtained from (14) or by setting q(D) oc exp{E^o log p(y, 9)} as in VB. The shape 
parameter v q can be shown to be deterministic: v q = n+u and the update for S q is given in 
step 4 of Algorithm 3. The updates of the parameters of q(f3) and q(ai), i — 1, n, have 
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to be computed using (15) as p(yi\/3, &i) connected to (3 and <5j is a non-conjugate factor. 
The factors connected to (3 are p(f3\T,p), p(eti\/3, D) and p(yi\/3, i — 1, n (see Figure 
§. Let S> = E q {\ogp{(3\T, p )}, S &i = E q {logp{ai\p, D)} and S yi = ^{log^l/?, 5*)}, 
i = 1, ...,n, where denotes expectation with respect to q. We have 

n n 

E dS a dSp \- dS &i \- 

dvec(E^) avec(S') Z^ 9vec(S«) Z^ 

ae7V(/3) i=l i=l 



n n 



3/$ "T Z^ 5^ ' 

a&N((3) i=l i=l 

and the simplified updates for E^ and are given in step 2 of Algorithm 3. The factors 
connected to are p(6ti\/3,D) and p(yi|/3, a*) for % = 1, n (see Figure [T]). Hence 



Hi 



E dSa _ g^gj , anr | \^ as^ _ , 

0vec(E|.) - 0vec(E«.) + 0vec(E«.) dI1U Z> 5^ ~ ^ + ^| . " 

The simplified updates for E|. and fi q & . are given in step 3 of Algorithm 3. See Appendix 
A for the evaluation of Sp, S &i and S yv All gradients can be computed using vector 
differential calculus (see Magnus and Neudecker, 1988). 

For responses from the Poisson family, S Vi can be evaluated in closed form. However, 
S Vi cannot be evaluated analytically for Bernoulli responses. Knowles and Minka (2011) 
discussed several alternatives in handling this integral. One could construct a bound on 
log(l + e x ) such as the 'quadratic' bound (Jaakkola and Jordan, 2000) or the 'tilted' 
bound (Saul and Jordan, 1999). We observed a negative bias in the estimates for the 
random effects variances when using the 'tilted bound' in Algorithm 3. This negative 
bias decreases as the cluster size increases (see also Rijmen and Vomlel, 2008). Hence, 
we use quadrature to compute the expectation and gradients. Following Ormerod and 
Wand (2012), we reduce all high-dimensional integrals to univariate ones and evaluate 
these efficiently using adaptive Gauss- Hermite quadrature (Liu and Pierce, 1994). The 
details are given in Appendix B. 

While the updates in Algorithm 1 can be simplified if Wi = I (noncentered) or 
(centered) and are more complex in the partially noncentered case, the reduction in 
efficiency is minimal. Moreover, with a good initialization, it is feasible to keep Wi as 
fixed throughout the course of running Algorithm 3 so that no additional computation 
time is used in updating Wi. We use the fit from penalized quasi-likelihood implemented 
via the function glmmPQL() in the R package MASS (Venables and Ripley, 2002) to 
initialize Algorithm 3. In our experiments, the lower bound computed at the end of 
each cycle of updates is usually on an increasing trend although there might be some 
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instability at the beginning. In cases where the algorithm does not converge, we found 
that changing the initialization can help to alleviate the situation. Although the lower 
bound is not guaranteed to increase at the end of each cycle, we continue to use it as 
a means of monitoring convergence and Algorithm 3 is terminated when the absolute 
relative change in the lower bound is less than 1CT 6 . The lower bounds for the logistic 
and Poisson GLMMs are presented in Appendix A. 

5 Model selection based on variational lower bound 

At the point of convergence of Algorithm 3, the lower bound on the log marginal like- 
lihood, logp(y), is maximized. This variational lower bound is often tight and can be 
useful for model selection. Bayesian model selection is traditionally based on computa- 
tion of Bayes factor in which marginal likelihood plays an important role. Suppose there 
are k candidate models, Mi, M k . Let p(Mj) and p(y\Mj) denote the prior probability 
and marginal likelihood of model Mj respectively. To compare any two models, say Mj 
and My, consider the posterior odds in favour of model Mf. 

p{Mj\y) = p(M)p(?/|M) 
p{Mj\y) p(M j )p(y\Mj)' 

The ratio of the marginal likelihoods, , is the Bayes factor and can be considered as 

the strength of evidence provided by the data in favour of model Mj over Mj. Therefore, 
model comparison can be performed using marginal likelihoods once a prior has been 
specified on the models. See O'Hagan and Forster (2004) for a review of Bayes factors and 
alternative methods for Bayesian model choice. In Section 6.4, we demonstrate how the 
variational lower bound, a by-product of Algorithm 3, can be can be used in place of the 
log marginal likelihood to obtain approximate posterior model probabilities, assuming 
all models considered are equally probable. Formerly, Corduneanu and Bishop (2001) 
verified through experiments and comparisons with cross-validation that the variational 
lower bound is a good score for model selection in Gaussian mixture models. 

We note that standard model selection criteria such as AIC or BIC arc difficult to 
apply to GLMMs as it is not straightforward to determine the degrees of freedom of 
a GLMM. Yu and Yau (2012) developed a conditional Akaike information criterion for 
GLMMs which takes into account estimation uncertainty in variance component param- 
eters. Overstall and Forster (2010) considered a default strategy for Bayesian model se- 
lection addressing issues of prior specification and computation. See also Cai and Dunson 
(2008) for a review of variable selection methods for GLMMs. 
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6 Examples 



We investigate the performance of Algorithm 3 using different parametrizations by con- 
sidering a simulation study and some real data sets. When using partial noncentering, 
we can either initialize the tuning parameters, Wi for i — 1, n, and keep them as fixed 
or update them at the beginning of each cycle (see Algorithm 3 step 1). Such updates 
are particularly useful when a good initialization is lacking. We present results for both 
cases. There might not be significant improvement in updating Wi in the examples below 
as the initialization using penalized quasi-likelihood is already good. 

We assessed the performance of Algorithm 3 using different parametrizations by using 
MCMC as a 'gold standard'. Fitting via MCMC was performed in WinBUGS (Lunn et 
al, 2000) through R by using R2WinBUGS (Sturtz et al, 2005) as an interface. WinBUGS 
automatically implements a Markov chain simulation for the posterior distribution after 
the user specifies a model and starting values (see, e.g. Gelman et al, 2003). We used the 
centered parametrization when specifying the model in WinBUGS as this produced better 
mixing than the noncentered parametrization for most of the examples considered (see 
Brown and Zhou, 2010). The MCMC algorithm was initialized similarly using the fit from 
penalized quasi-likelihood. In each case, three chains were run simultaneously to assess 
convergence, each with 50000 iterations, and the first 5000 iterations were discarded 
in each chain as burn-in. A thinning factor of 10 was applied to reduce dependence 
between draws. The posterior means and standard deviations reported were based on the 
remaining 13500 iterations. The computation times reported for MCMC are the times 
taken for updating in WinBUGS. We used the same priors for MCMC and Algorithm 3. 
For the fixed effects, we used a iV(0, 10007) prior. All code was written in the R language 
and run on a dual processor Windows PC 3.30 GHz workstation. 

6.1 Simulated data 

In this simulation study, we consider the Poisson random intercept model 



where Ui ~ N(0,a 2 ). For the Poisson random intercept model, we set = j — 1 for 
i = 1, ...,100, j = 1,2 and used f3 — fi\ — —0.5, a = 0.1. For the logistic random 



yij\ui ~ Poisson (exp(/3 + fax^ + ui)) 



and the logistic random intercept model 
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intercept model, we set Xij = |, for i = 1, ...,50, j = 1, ...,8, and used (3 — 0, /?i = 5, 
a = \/l.5. Similar settings have been considered by Ormerod and Wand (2012). For 
each model, 100 data sets were generated. No convergence issues were encountered for 
these simulated data but experience with other simulated data sets (not shown) indicate 
that problems may arise when the covariance matrix of the fixed effects estimated from 
penalized quasi-likelihood is nearly singular or when the standard deviation of the random 
effects are very close to zero. In such cases, we can use alternative means of initialization 
such as estimates from the generalized linear model obtained by setting the random 
effects as zero. The expression in ^ can also serve as a prior guess for D (see Kass 
and Natarajan, 2006). Table [l] reports the estimates from penalized quasi-likelihood and 
the posterior means and standard deviations estimated by Algorithm 3 (using different 
parametrizations) and MCMC. Results are averaged over the 100 sets of simulated data. 
We have also included root mean squared errors computed as yj ^ X/j=i(^ ~^i) 2 f° r an 
estimate di from the /th simulated data set obtained from penalized quasi-likelihood or 
Algorithm 3 where $° is the corresponding estimate from MCMC regarded as the 'gold 
standard'. 

For the Poisson model, the posterior means of the fixed effects and random effects 
estimated using the centered and noncentered parametrizations are quite close and also 
close to that of MCMC. However, the posterior standard deviations of the fixed effects 
are underestimated in the centered parametrization and the noncentered parametriza- 
tion does better. The average time to convergence was shorter with noncentering and a 
higher lower bound was attained on average. We observe that the partially noncentered 
parametrization where tuning parameters were not updated took on average the least 
time to converge and produced a fit closer to that of the noncentered parametrization 
but with improvements in the estimation of the posterior means of the random effects. 
When the tuning parameters were updated, the fit was just as good although computa- 
tion time was longer. For the logistic model, centering and noncentering have different 
merits. While centering produced better estimates of the posterior means, the posterior 
standard deviations of the fixed effects were underestimated. The partially noncentered 
parametrization tries to adapt between the centered and noncentered parametrizations, 
producing better estimates of the posterior means than noncentering and better esti- 
mates of the posteriorstandard deviations than centering. When the tuning parameters 
were updated, the results leaned more towards the noncentered parametrization and the 
algorithm took longer to converge. In both cases, Algorithm 3 using the partially non- 
centered parametrization was faster than MCMC and provided better estimates of the 
fixed effects and random effects than penalized quasi-likelihood. There are some difficul- 
ties however in comparing Algorithm 3 and MCMC in this way as the time taken for 
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Model Method /3 SE(/3 ) & SE(/3i) cr SE(cr) Time £ 



Poisson 



Logistic 



pClldllZCU 

quasi- 

1 1 L""f~i 1 1 It r\ r\ r\ 
11 Jvt- 11 11U U (_1 


-0.54 
(0.11) 


0.13 
(0.02) 


-0.48 
(0.01) 


0.19 
(0.03) 


0.27 
(0.35) 


— 


0.1 


— 


noncentered 


-0.63 
(ft m ~\ 


0.13 

(ft ftO\ 


-0.49 
!<-ft ftft'W 


0.21 
I vn ftftK\ 


0.48 
(ft 09"; 


0.03 
(ft ns"i 


3.6 


-196.0 


centered 


-0.63 
(0.01) 


0.05 
(0.10) 


-0.50 
(0.01) 


0.16 
(0.05) 


0.50 
(0.01) 


0.04 
(0.07) 


4.3 


-197.0 


T"\0 T*"T" TO 1 1 T 7" 

pdl LlcLll y 

noncentered: 
Wi fixed 


-0.63 
(0.01) 


0.13 
(0.02) 


-0.49 
(<0.005) 


0.20 
(0.01) 


0.49 
(0.01) 


0.03 
(0.08) 


3.5 


-196.0 


1~1 T*"f" 1(1 1 \\T 

\JdbL LlcLllj 

noncentered: 
Wj updated 


-0.63 
(0.01) 


0.13 
(0.02) 


-0.49 
(<0.005) 


0.19 
(0.02) 


0.49 
(0.01) 


0.03 
(0.08) 


4.0 


-196.0 


MCMC 


-0.64 


0.15 


-0.48 


0.21 


0.50 


0.11 


60.1 




penalized 

quasi- 
likcliliood 


-0.10 
(0.06) 


0.32 
(0.07) 


5.02 
(0.27) 


0.63 
(0.24) 


1.25 
(0.16) 




0.2 




noncentered 


-0.07 
(0.02) 


0.33 
(0.06) 


5.20 
(0.04) 


0.77 
(0.09) 


1.18 
(0.06) 


0.12 
(0.20) 


3.2 


-140.4 


centered 


-0.07 
(0.02) 


0.17 
(0.21) 


5.24 
(0.02) 


0.41 
(0.45) 


1.24 
(0.03) 


0.13 
(0.20) 


3.1 


-141.1 


partially 
noncentered: 
Wi fixed 


-0.07 
(0.02) 


0.30 
(0.09) 


5.23 
(0.02) 


0.50 
(0.37) 


1.22 
(0.03) 


0.12 
(0.20) 


2.9 


-140.5 


partially 
noncentered: 
Wi updated 


-0.07 
(0.02) 


0.30 
(0.08) 


5.21 
(0.04) 


0.50 
(0.36) 


1.22 
(0.04) 


0.12 
(0.20) 


3.9 


-140.5 



i upuaieu 

MCMC ^L05 038 5^23 085 L24 032 146J 



Table 1: Results of simulation study showing initialization values from penalized quasi- 
likelihood, posterior means and standard deviations estimated by Algorithm 3 (different 
parametrizations) and MCMC, computation times (seconds) and variational lower bounds 
(£), averaged over 100 sets of simulated data. Values in () are the corresponding root 
mean squared errors. 
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Algorithm 3 to converge depends on the initialization, stopping rule and the rate of con- 
vergence also depends on the problem. Similarly, the updating time taken for MCMC is 
also problem-dependent and depends on the length of burn-in and number of sampling 
iterations. In addition, we observed (in simulated data sets not shown) that posterior 
inferences can be sensitive to prior assumptions on the variance components in Poisson 
models where many of the counts are close to zero or in binary data where the cluster 
size is small (see Browne and Draper, 2006 and Roos and Held, 2011). 



6.2 Epilepsy data 

Here we consider the epilepsy data of Thall and Vail (1990) which has been analyzed 
by many authors (see, e.g., Breslow and Clayton, 1993, Ormerod and Wand, 2012). In 
this clinical trial, 59 epileptics were randomized to a new anti-epileptic drug, progabide, 
(Trt=l) or a placebo (Trt=0). Before receiving treatment, baseline data on the number 
of epileptic seizures during the preceding 8-week period were recorded. The logarithm of 
\ the number of baseline seizures (Base) and the logarithm of age (Age) were treated as 
covariates. Counts of epileptic seizures during the 2-weeks before each of four successive 
clinic visits (Visit, coded as Visiti = —0.3, Visit2 = —0.1, Visits = 0.1 and Visit4 = 0.3) 
were recorded. A binary variable (V4=l for fourth visit, otherwise) was also considered 
as a covariate. We consider models II and IV from Breslow and Clayton (1993). Model II 
is a Poisson random intercept model where 

log ^ = /3 + /3 B aseBasej + /3 Tr tTrtj + /3 Ba sexTrtBasei x Trtj + (3 Age Agei + /0v 4 V4y + «i 

for i = 1, ...,n, j = 1, ...,4 and Uj ~ iV(0,<r 2 ). Model IV is a Poisson random intercept 
and slope model of the form 

log fjLij = (3 Q + /3 B aseBasei + /3 Trt Trti + /^BasexTrtBase, x Trtj + f3 Agc Age i + /3 V isit Visits 
+ uu + M 2 iVisit ij - 



forz = l,..,n,j = l,...,4and[^]~jV 



ofl CT12 

(721 erj 2 



. As the MCMC chains for intercept 
and Age were mixing poorly, we decided to center the covariate Age. In the analysis that 
follows, we assume Age^ has been replaced by Age^ — mean(Age). 

Table [2] shows the estimates of the posterior means and standard deviations of the 
fits from MCMC and Algorithm 3 (using different parametrizations), initialization val- 
ues from penalized quasi-likelihood and computation times in seconds taken by different 
methods. All the variational methods are faster than MCMC by an order of magnitude 
which is especially important in large scale applications. In the noncentered parametriza- 
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penalized 






partially 


partially 






quasi- 


noncentered 


centered 


noncentered: 


noncentered: 


MCMC 




likelihood 






Wi fixed 


Wi updated 




Model II 
















0.31 ± 0.26 


0.26 ± 0.11 


0.27 ± 0.24 


0.27 ± 0.26 


0.27 ± 0.27 


0.26 ± 0.27 


/?Base 


0.88 ± 0.13 


0.89 ± 0.04 


0.88 ± 0.13 


0.88 ± 0.13 


0.88 ± 0.14 


0.89 ± 0.14 




-0.91 ± 0.41 


-0.94 ± 0.15 


-0.94 ± 0.36 


-0.94 ± 0.40 


-0.94 ± 0.41 


-0.94 ± 0.42 


/^BasexTrt 


0.34 ± 0.20 


0.34 ± 0.06 


0.34 ± 0.19 


0.34 ± 0.21 


0.34 ± 0.21 


0.34 ± 0.21 


HAgc 


54 ± 35 

V7 • • / 1 — 1 — VI • t J 'J 


50 ± 12 


48 ± 33 


48 ± 35 


48 ± 36 

V 7 . 1 v..J — 1 — V / ■ 1 7 V 


48 ± 37 

v / . i v ' — i — v7iU i 


B\TA 


-0 16 ± 08 


-0 16 ± 05 


-0 1 6 ± 05 


-0 16 ± 05 


-0 16 ± 05 

V 7 . 1 V7 — 1 — V ' . \ 7 7 


-0 1 6 ± 05 


(j 


0.44 


50 ± 05 


54 ± 05 

V7 . 17 j: _1_ \7 • V 7 17 


53 ± 05 

V ' - '.J f f — 1 — V / . \J ..7 


53 ± 05 

V7 - . 7 f f — 1 — V ' ■ V ! ..7 


53 ± 06 

VJiUfJ —1— V7 . v7 v7 


c 




-707.3 


-702.0 


-701.6 


-701.5 




Time 


0.2 


1.1 


0.4 


0.4 


0.6 


61 


Model IV 














A) 


0.27 ± 0.26 


0.21 ± 0.10 


0.21 ± 0.24 


0.21 ± 0.26 


0.21 ± 0.26 


0.21 ± 0.27 


/?Baso 


0.88 ± 0.13 


0.89 ± 0.04 


0.88 ± 0.13 


0.89 ± 0.13 


0.89 ± 0.13 


0.88 ± 0.14 


/?Trt 


-0.92 ± 0.41 


-0.94 ± 0.15 


-0.93 ± 0.36 


-0.93 ± 0.40 


-0.93 ± 0.40 


-0.94 ± 0.42 


/?BasoxTrt 


0.35 ± 0.20 


0.34 ± 0.06 


0.34 ± 0.19 


0.34 ± 0.20 


0.34 ± 0.21 


0.34 ± 0.22 


/?Age 


0.54 ± 0.35 


0.49 ± 0.12 


0.47 ± 0.32 


0.47 ± 0.35 


0.47 ± 0.35 


0.47 ± 0.37 


/Visit 


-0.28 ± 0.16 


-0.27 ± 0.10 


-0.27 ± 0.10 


-0.27 ± 0.14 


-0.27 ± 0.15 


-0.27 ± 0.17 


CTll 


0.45 


0.50 ± 0.05 


0.53 ± 0.05 


0.52 ± 0.05 


0.53 ± 0.05 


0.53 ± 0.06 


0-22 


0.46 


0.75 ± 0.07 


0.77 ± 0.07 


0.75 ± 0.07 


0.76 ± 0.07 


0.76 ± 0.15 


£ 




-701.4 


-696.1 


-695.3 


-695.1 




Time 


0.5 


1.5 


1.3 


1.2 


1.4 


122 



Table 2: Results for epilepsy data models II and IV showing initialization values from 
penalized quasi-likelihood, posterior means and standard deviations (values after ±) es- 
timated by Algorithm 3 (different parametrizations) and MCMC, computation times 
(seconds) and variational lower bounds (£). 

tion, the standard deviations of the fixed effects were underestimated and the centered 
parametrization does better in this aspect. The partially noncentered parametrization 
produced a fit that is closer to that of the centered parametrization and improved upon it. 
In both models, the fits produced by partial noncentering are very close to that produced 
by MCMC and are superior to that of the centered and noncentered parametrizations. 
The lower bound attained by partial noncentering is also higher than that of centering 
and noncentering, giving a tighter bound on the log marginal likelihood. It is important 
to emphasize that the relevant comparison is of the partially noncentered parametrization 
to the worst of the centered and noncentered parametrizations, since in general we do 
not know if centering or noncentering is better without running both algorithms. Partial 
noncentering on the other hand, automatically chooses a near optimal parametrization. 
Updating of the tuning parameters helped to improve the fit produced by partial noncen- 
tering. Figure [2] shows the marginal posterior distributions for parameters in models II 
and IV estimated by MCMC (solid line) and Algorithm 3 using the partially noncentered 
parametrization where tuning parameters are updated (dashed line). The variational pos- 
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Figure 2: Marginal posterior distributions for parameters in model II (first two rows) 
and model IV (last two rows) of the epilepsy data estimated by MCMC (solid line) and 
Algorithm 3 using partially noncentered parametrization where tuning parameters are 
updated (dashed line). 

terior densities of the fixed effects are very close to those obtained via MCMC. For the 
variance components, there is still some underestimation of the posterior variance. 

6.3 Toenail data 

This data set was obtained from a multicenter study comparing two competing oral 
antifungal treatments for toenail infection (De Backer et al, 1998), courtesy of Novoartis, 
Belgium. It contains information for 294 patients to be evaluated at seven visits. Not all 
patients attended all seven planned visits and there were 1908 measurements in total. The 
patients were randomized into two treatment groups, one group receiving 250 mg per day 
of terbinafine (Trt=l) and the other group 200 mg per day of itraconazole (Trt=0). Visits 
were planned at weeks 0, 4, 8, 12, 24, 36, and 48 but patients did not always arrive as 
scheduled and the exact time in months (t) that they did attend was recorded. The binary 
response variable (onycholysis) indicates the degree of separation of the nail plate from 
the nail-bed (0 if none or mild, 1 if moderate or severe). We consider the following logistic 
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penalized 






partially 


partially 






quasi- 


noncentered 


centered 


noncentered: 


noncentered: 


MCMC 




likelihood 






Wi fixed 


Wi updated 




A) 


-0.75 ± 0.25 


-1.41 ± 0.17 


-1.44 ± 0.29 


-1.44 ± 0.35 


-1.44 ± 0.32 


-1.65 ± 0.44 


fart 


-0.04 ± 0.35 


-0.13 ± 0.25 


-0.13 ± 0.41 


-0.13 ± 0.49 


-0.13 ± 0.45 


-0.17 ± 0.60 


Pt 


-0.30 ± 0.03 


-0.38 ± 0.04 


-0.38 ± 0.03 


-0.38 ± 0.03 


-0.38 ± 0.03 


-0.40 ± 0.05 


0Trt X Time 


-0.10 ± 0.05 


-0.13 ± 0.06 


-0.13 ± 0.04 


-0.13 ± 0.04 


-0.13 ± 0.04 


-0.14 ± 0.07 


a 


2.32 


3.52 ± 0.15 


3.56 ± 0.15 


3.55 ± 0.15 


3.55 ± 0.15 


4.10 ± 0.39 


C 




-664.1 


-663.1 


-662.7 


-662.9 




Time 


2.8 


37.9 


27.9 


26.0 


24.1 


1072 



Table 3: Results for toenail data showing values used for initialization from penalized 
quasi-likelihood, posterior means and posterior standard deviations (values after ±) from 
Algorithm 3 (different parametrizations) and MCMC, computation times (seconds) and 
variational lower bounds (£). 

random intercept model, 

logit(/%) = A) + /^TrtTrtj + 0^ + (3 Trtxt Trti x + u u 

where u { ~ JV(0, a 2 ) for % = 1, 294, 1 < j < 7. 

Table [3] shows the posterior means and standard deviations of the fits from MCMC 
and Algorithm 3 (using different parametrizations), initialization values from penalized 
quasi-likelihood and computation time in seconds taken by different methods. Again 
the VB methods are faster than MCMC by an order of magnitude. In this example, 
centering produced a better fit than noncentering and partial noncentering produced a fit 
closer to that of the centered parametrization but improving it. Partial noncentering also 
took less time to converge and attained a lower bound higher than that of the centered 
and noncentered parametrizations. Again, we emphasize that it is not easy to know 
beforehand which of centering or noncentering will perform better, and a big advantage 
of partial noncentering is the way that it automatically chooses a good parametrization. 
In this example, updating the tuning parameters did not result in a better fit although 
the time to convergence is reduced. The marginal posterior distributions estimated by 
MCMC (solid line) and Algorithm 3 using the partially noncentered parametrization 
where tuning parameters were not updated (dashed line) are shown in Figure [3j Compared 
with the MCMC fit, there is still some underestimation of the variance of the fixed 
effects particularly for the parameters which could not be centered. Although the partially 
noncentered parametrization has improved the estimation of the random effects from the 
initial penalized quasi-likelihood fit, there is still some underestimation of the mean and 
variance of the random effects when compared to the MCMC fit. 
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Figure 3: Marginal posterior distributions for parameters in toenail data estimated by 
MCMC (solid line) and Algorithm 3 using partially noncentered parametrization (tuning 
parameters not updated) (dashed line). 
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-3.05 ± 0.13 


-3.29 ± 0.25 


/3a S c 


-0.24 ± 0.08 


-0.22 ± 0.07 


-0.21 ± 0.02 


-0.22 ± 0.07 


-0.22 ± 0.07 


-0.25 ± 0.16 




2.52 


2.16 ± 0.07 


2.16 ± 0.07 


2.16 ± 0.07 


2.16 ± 0.07 


2.48 ± 0.24 


C22 


1.19 


0.55 ± 0.02 


0.56 ± 0.02 


0.55 ± 0.02 


0.55 ± 0.02 


0.61 ± 0.10 


C 




-833.2 


-834.1 


-832.8 


-832.6 




Time 


3.8 


114.7 


125.8 


110.6 


120.6 


1010 



Table 4: Results for six cities data showing values used for initialization from penalized 
quasi-likelihood, posterior means and posterior standard deviations (values after ±) from 
Algorithm 3 (different parametrizations) and MCMC, computation times (seconds) and 
variational lower bounds (£). 



6.4 Six cities data 

In the previous two real data examples, centering performed better than noncentering and 
partial noncentering was able to improve on the centering results. While centering often 
performs better than noncentering, we use this example to show that partial noncentering 
will automatically tend towards noncentering when noncentering is preferred. We consider 
the six cities data in Fitzmaurice and Laird (1993), where the binary response variable 
i/ij indicates the wheezing status (1 if wheezing, if not wheezing) of the ith child at 
time-point j, i — 1, 537, 2, 3,4. We use as covariate the age of the child at time- 
point j, centered at 9 years (Age) and consider the following random intercept and slope 
model 

logit(//ij) = /3 + /^AgeAge^ + u u + u 2i kge i 



ofi <T 12 
CT21 CT| 2 



. This model has been considered 



for i = 1, ...,537, j = 1, 4 and [£; ] ~ iV (0, 
in Overstall and Forster (2010). 

Table [4] shows the estimates of the posterior means and standard deviations of the fits 
from MCMC and Algorithm 3 using different parametrizations, the values from penalized 
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quai-likelihood used for initialization and the computation times in seconds taken by 
different methods. Noncentering performed better than centering in this case with a 
shorter time to convergence, higher lower bound and a better estimate of the posterior 
standard deviation of /?Agc- Partial noncentering further improved upon the results of 
noncentering with an improved estimate of the posterior standard deviation of (3 and 
faster convergence. All the variational methods are again faster than MCMC by an order 
of magnitude. 

6.5 Owl data 

In this example we illustrate the use of the variational lower bound, a by-product of 
Algorithm 3, for model selection. For MCMC, on the other hand, it is not straightforward 
in general to get a good estimate of the marginal likelihood based on the MCMC output. 
It is also not always obvious how to apply standard model selection criteria like AIC and 
BIC to hierarchical models like GLMMs. 

Roulin and Bersier (2007) analysed the begging behaviour of nestling barn owls and 
looked at whether offspring beg for food at different intensities from the mother than 
father. They sampled n = 27 nests and counted the number of calls made by all offspring 
in the absence of parents. Half of the nests were given extra prey, and from the other 
half, prey were removed. Measurements took place on two nights, and food treatment was 
swapped the second night. The number of measurements at each nest ranged from 4 to 
52 with a total of 599. We use as covariates sex of parent (Sex=l if male, if female), the 
time at which a parent arrived with a prey (t), and food treatment (Trt = 1 if 'satiated', 
if 'deprived'). The number of nestlings per nest (broodsize, E) ranged from 1 to 7. 

Zuur et al. (2009) modelled the number of calls at nest i for the jth observation as a 
Poisson distribution with mean fiij and used log transformed broodsize as an offset with 
nest as a random effect. The prime aim of their analysis was to find a sex effect and the 
largest model they considered was 

1. log(/iy) = log(Ey) + A) + /0s«Sexij + ArrtTrty + faUj + P Sex x Trt SeXjj X Trtjj 
+/3scxxt Sexjj x tij + Ui 

where log(Eij) is an offset and Ui ~ N(0,cr 2 ) for % = 1, ...,27, j = 1, ...,rij. At the 
recommendation of Zuur et al. (2009), we center t to reduce correlation of t with the 
intercept. Henceforth, we assume has been replaced by — mean(t). In the first stage, 
we consider models 1 to 4 and determine if the two interaction terms should be retained. 
Models 2 to 4 are as follows: 

SexxTrt SeXjj X Trtjj + Ui, 
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partially 


partially 




noncentered 


centered 


noncentered: 


noncentered: 








Wi fixed 


Wi updated 


First stage 


Model 1 


-2544.6 (0.2) 


-2543.7 (0.3) 


-2543.6 (0.4) 


-2543.7 (0.6) 


Model 2 


-2537.6 (0.2) 


-2536.6 (0.3) 


-2536.6 (0.4) 


-2536.6 (0.5) 


Model 3 


-2540.2 (0.2) 


-2539.2 (0.3) 


-2539.2 (0.3) 


-2539.2 (0.5) 


Model 4 


-2533.2 (0.2) 


-2532.1 (0.3) 


-2532.1 (0.3) 


-2532.1 (0.4) 


Second stage 


Model 5 


-2527.0 (0.2) 


-2525.5 (0.2) 


-2525.5 (0.2) 


-2525.4 (0.3) 


Model 6 


-2628.3 (0.2) 


-2627.2 (0.3) 


-2627.1 (0.3) 


-2627.1 (0.5) 


Model 7 


-2664.0 (0.2) 


-2662.9 (0.2) 


-2662.8 (0.3) 


-2662.8 (0.4) 


Third stage 


Model 8 


-2621.5 (0.2) 


-2620.0 (0.2) 


-2620.0 (0.2) 


-2620.0 (0.3) 


Model 9 


-2660.4 (0.2) 


-2658.8 (0.2) 


-2658.8 (0.2) 


-2658.8 (0.2) 


Model 10 


-2689.4 (<0.05) 








Final stage 


Model 11 


-2448.7 (1.1) 


-2445.7 (0.4) 


-2445.8 (0.3) 


-2445.6 (0.4) 



Table 5: Variational lower bounds for models 1 to 11 and computation time in brackets. 

3. l0g(/%) = log(Ey) + O + /3 S exSeXy + /^TrtTrty + fytij + /3 Sex xt SeXjj X tij + Ui, 

4. log(/iy) = log(Eij) + O + pSexSexij + 0Tvt^Uj + 0tUj + «i- 

From Table |5j the preferred model (with the highest lower bound) is model 4 where both 
interaction terms have been dropped from model 1. Next, we consider models 5 to 7 where 
the main terms sex, food treatment and arrival time are each dropped in turn, 

5. log(/i ij ) = log(E i:j ) + O + /Witty + PtUj + Ui, 

6. log(yUy) = log(Eij) + p Q + (3 Tlt Titij + /3 Sex Sex ij + w 4 , 

7. log(/%) = log(Eij) + (3 + (3 t tij + f3 Sex Sexij + u f . 

Table [5] indicates that model 5 is the preferred model where the term sex of the parent has 
been dropped from model 4. Now we consider dropping each of the terms food treatment 
and arrival time in turn or dropping the random effects Ui, 

8. log(/%) = log(E i:j ) + (3 + /^TrtTrty + u h 

9. log(/iy) = log(-Ey) + 0o + (3 t Uj + Ui, 

10. l0g(/iy) = l0g(E i:j ) +(3 + 0Trt^Uj + 0tUj- 

Table [5] indicates that none of the main terms food treatment and arrival time as well as 
random effects should be dropped from model 5. Finally we consider adding a random 
slope for arrival time, 
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penalized 






partially 


partially 






quasi- 


noncentered 


centered 


noncentered: 


noncentered: 


MCMC 




likelihood 






Wi fixed 


Wi updated 




Po 


0.60 ± 0.07 


0.53 ± 0.02 


0.51 ± 0.08 


0.51 ± 0.08 


0.51 ± 0.09 


0.50 ± 0.10 


pTrt 


-0.55 ± 0.08 


-0.57 ± 0.03 


-0.57 ± 0.03 


-0.57 ± 0.03 


-0.57 ± 0.03 


-0.57 ± 0.04 


A 


-0.13 ± 0.03 


-0.15 ± 0.01 


-0.16 ± 0.04 


-0.16 ± 0.04 


-0.16 ± 0.04 


-0.16 ± 0.05 


CTll 


0.24 


0.44 ± 0.06 


0.46 ± 0.06 


0.45 ± 0.06 


0.46 ± 0.06 


0.47 ± 0.09 


022 


0.11 


0.22 ± 0.03 


0.23 ± 0.03 


0.22 ± 0.03 


0.23 ± 0.03 


0.23 ± 0.05 


Time 


0.4 


1.1 


0.4 


0.3 


0.4 


255 



Table 6: Results for owl data (model 11) showing values used for initialization from 
penalized quasi-likelihood, posterior means and standard deviations (values after ±) from 
Algorithm 3 (different parametrizations) and MCMC and computation times (seconds). 



11. log(/%) = log(^) + (3 + /^TrtTrty + PtUj + U U + U 2 iUj, 

the optimal model is model 11. This 



where [^] ~ N ( 



ffj cr 12 



From Table 

conclusion is similar to that of Zuur et al. (2009) and is the same regardless of which 
parametrization was used. It is thus sufficient to consider just the partially noncentered 
parametrization. The computation time taken by Algorithm 3 for each model fitting is 
very short and makes this a convenient way of carrying out model selection or for nar- 
rowing down the range of likely models. Further model comparisons can be performed 
using cross-validation or other approaches. 

We present the estimated posterior means and standard deviations for the optimal 
model in Table [6j The marginal posterior distributions estimated by MCMC (solid line) 
and Algorithm 3 using partially noncentered parametrization where tuning parameters 
are updated (dashed line) are shown in Figure |ij In this case, centering produced a 
better fit than noncentering and partial noncentering produced a fit that is close to that 
of centering. Updating the tuning parameters helped to improve the fit of the partially 
noncentered parametrization slightly and is closest to the MCMC fit. From the posterior 
density plots, there is good estimation of the posterior means by Algorithm 3 using 
partially noncentered parametrization with updated tuning parameters but there is still 
some underestimation of the posterior variance. 



7 Conclusion 

In this paper, we described a partially noncentered parametrization for GLMMs and 
compared the performance of different parametrizations using an algorithm called non- 
conjugate variational message passing developed recently in machine learning. Focusing 
on Poisson and logistic mixed models, we applied our methods to analysis of longitudinal 
data sets. For the logistic model, some parameter updates were not available in closed 
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0.0 0.4 0.8 -0.4 -0.1 -0.70 -0.50 0.0 0.4 0.8 0.00 0.10 

Intercept Arrival time Trt Variance comp 1 Variance comp 2 



Figure 4: Marginal posterior distributions for parameters in model 11 (owl data) esti- 
mated by MCMC (solid line) and Algorithm 3 using partially noncentered parametriza- 
tion where tuning parameters are updated (dashed line). 

form and we used adaptive Gauss-Hermite quadrature to approximate the intractable 
integrals efficiently Comparing the performance of Algorithm 3 under the partially non- 
centered parametrization with that of the centered and noncentered parametrizations, we 
observed that partial noncentering automatically tends towards the better of centering 
and noncentering so that it is not necessary to choose in advance between the centered and 
noncentered parametrizations. In many cases, the partially noncentered parametrization 
was able to improve upon the fit produced by the better of centering and noncenter- 
ing to produce a fit that was closest to that of MCMC. In terms of computation time, 
the partially noncentered parametrization can also provide more rapid convergence when 
centering or noncentering is particularly slow. Very often, the lower bound attained by 
the partially noncentered parametrization is also higher than that of the centered and 
noncentered parametrizations giving a tighter lower bound to the log marginal likelihood. 
To some degree, the partially noncentered parametrization also alleviates the issue of un- 
derestimation of the posterior variance leading to some improvement in the estimation of 
the posterior variance particularly in the fixed effects which could be centered. Algorithm 
3 under the partially noncentered parametrization thus offers itself as a fast, deterministic 
alternative to MCMC methods for fitting GLMMs with improved estimation compared to 
the centered and noncentered parametrizations. We also demonstrate that the variational 
lower bound produced as part of the computation in Algorithm 3 can be useful in model 
selection. 
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Appendix A: Evaluating the variational lower bound 

From Q, Q and Q, 



£ = J2 S v> + E S *< + S f> + E q {\ogp{D\v, S)} - E q {\og q{f3)} - ]T E q {\og q(a t )} 

i=l i=l i 

- E q {logq(D)}. 
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To evaluate the terms in the lower bound, we use the following two lemmas which we 
state without proof: 

Lemma 1. Suppose pi(x) = iV(/i 1; Si) andp 2 (x) = N(fi 2 , E 2 ) where x is a p-dimensional 
vector, then / p 2 (x) logpi(x) dx = — | log(27r) - \ log |Ej| - §(// 2 - A i i) T S^ 1 ( / u 2 - /Ui) - 
|tr(Sr 1 S 2 ). 

Lemma 2. Suppose p(-D) = IW(v, S) where D is a symmetric, positive definite r x r 
matrix, then / p(D) log \D\ dD = log |5| - JJi=i i> (^p) ~ r lo S 2 and / p(D)D- 1 dD = 
vS~ l where ?/>(•) denotes the digamma function. 

Using these two lemmas, we can compute most of the terms in the lower bound. 

= -J log(27r) - | log IS,) - - |tr(E^EJ) 

• S &1 = Jq({3)q(D)q(a i )\ogp(a i \P,D)df3dDda i 

= -|log(27r) - | {log |^| - £[ =1 V> (^) -Hog 2} 



2 



• E q {logp(D\u, S)} = J q(D) logp(D\u, S) dD 

= -ftr(S^S) - Ifc=i> log(Tr) - ELi logr (^) + I log |5| 
_h^±i { log _ ^ (*^±i) _ r log 2} - f log 2 

. £ g {logg(/3)} = /g(/3)logg(/3)d/3 

= -flog(27r)-ilog|Ej|-jj 

• E q {\ogq(ai)} = f q(a { ) log q(ai)dai 

= -|tog(27r)-|log|E| < |-§ 

• £^lo gg (D)} = / 9 (D) log <?(£>) d£> 

= -^io g 2 - ^ logvr - zit logr (^) + f log \s<\ 

{log - ELi ^ (^) - r log2} - f 
The only term left to evaluate is 

S Vi= / q(f3)q(ai)fogp(yi\/3,a>i)df3da>i. 



For Poisson responses with the log link function (see (J9J)), 

S yi = wf{]og(30 + V^} + - - 1^ Iog(wO 



32 



where m = exp{V r i //J + + §diag(K£^ T + Xf-T^X? 1 )}. For Bernoulli responses 

with the logit link function (see (10)), 

m 

S Vi = vI(V it i$ + Xf/4) -J^E q [log{l + exp(V;J/3 + xfa,)} 



3=1 



where E n 



log{l + exp(V?f3 + Xfj T ai)} is evaluated using adaptive Gauss-Hermite quadra- 



ture (see Appendix B). The variational lower bound is thus given by 



i=l 



i=l 



+ ^iog | S | - ± log r (^^) + E log r (^±|^) 

{=1 ^ ' 1=1 ^ ' 



v + 1 — l\ p + nr nr , 

+ 1 log 2. 

2 / 2 2 6 



Note that this expression is valid only after each of the parameter updates has been made 
in Algorithm 3. 



Appendix B: Gauss-Hermite quadrature for logistic 
mixed models 



n 



We want to evaluate E q {b{V? (3 + Xg T on)} where b(x) = log(l + e x ) for each i — 1, 
and j = 1, n,. Let ^ = Vg>« + xfp% and a% = V^EjVy + Xf^Xg. Following 
Ormerod and Wand (2012), we reduce i? g {&(V^J/3 + X^ T aj)} to a univariate integral such 
that 

J — oo 

where 0(x; /i, a) denotes the Gaussian density for a random variable x with mean /i and 
standard deviation a. Let B^ r \fi, a) = J™ b^ r \ax + /i)0(x; 0, 1) dx where U r \x) denotes 
the rth derivative of &(•) with respect to x. If /i and cr are vectors, say /i = 



a 



then B^(/i,a) 



B< r )(l,4) 
B< r )(2,5) 
£M(3,6) 

Vifil + Xf[i\ and a, t = (a a , <7 in J T : 



and 



. For each cluster i, let \ii = (fin, /ij n J J 



We evaluate B^(fiij,aij) using adaptive Gauss-Hermite quadrature (Liu and Pierce, 
1994) for each i = 1, ...n, j = 1, and r = 0,1,2. Ormerod and Wand (2012) 
has considered a similar approach. In Gauss-Hermite quadrature, integrals of the form 
JZ° f( x ) e ~ x2 dx are approximated by YlT=i w kf{xk) where m is the number of quadra- 
ture points, the nodes Xi are zeros of the mth order Hermite polynomial and Wi are 
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suitably corresponding weights. This approximation is exact for polynomials of degree 
2m — 1 or less. For low-order quadrature to be effective, some transformation is usually 
required so that the integrand is sampled in a suitable range. Following the procedure 
recommended by Liu and Pierce (1994), we rewrite B^^j, a^) as 



B (r) 



r 

J — c 



b^{oijX + Hij)(fi(x; 0, 1) 



<f)(x; fiij^i^dx 



■t 



e x2 b ( - r \a ij (p, ij + V^cr^x) + Hij)(t)(jlij + \/2&ijX] 0, 1)J e ^ dx 
which can be approximated using Gauss-Hermite quadrature by 

m 
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For the integrand to be sampled in an appropriate region, we take p,^ to be the mode of 
the integrand and to be the standard deviation of the normal density approximating 
the integrand at the mode, so that 

fiij = argmax {b^ r \a i:j x + /J,ij)<f>(x; 0, 1)} 



d 2 



dx 2 



log {b^^x + Hij)(t>{x] 0, 1) } 



for j = l,...,rii and i = l,...,n. For computational efficiency, we evaluate /t^ and 
i = 1, ...,n, j = 1, ...,rii for the case r = 1 only once in each cycle of updates and use 
these values for r = 0, 2. No significant loss of accuracy was observed in doing this. We 
implement adaptive Gauss-Hermite quadrature in R using the R package f astGHQuad 
(Blocker, 2011). The quadrature nodes and weights can be obtained via the function 
gaussHermiteData ( ) and the function aghQuad ( ) approximates integrals using the 
method of Liu and Pierce (1994). We used 10 quadrature points in all the examples. 
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